Accuracy of Chatbots in Citing Journal Articles

This cross-sectional study quantifies the journal article citation error rate of an artificial intelligence chatbot.


Introduction
The recently released generative pretrained transformer chatbot ChatGPT from OpenAI has shown unprecedented capabilities ranging from answering questions to composing new content. 1 Its potential applications in health care and education are being explored 2 and debated. 3 Researchers and students may use it as a copilot in research. It excels at creating new content but falls short in providing scientific references. Journals such as Science have banned chatbot-generated text in their published reports. 4 However, the accuracy of reference citing by ChatGPT is unclear; therefore, this investigation aimed to quantify ChatGTP's citation error rate.

Methods
This study tested the value of the ChatGPT copilot in creating content for training of learning health systems (LHS). 5 A large range of LHS topics were discussed with the latest GPT-4 model from OpenAI from April 20 to May 6, 2023. We used prompts for broad topics, such as LHS and data, as well as specific topics, such as building a stroke risk prediction model using the XGBoost library ( Table 1).
Since chatbot responses depended on the prompts, we first asked questions about specific LHS topics, then requested journal articles as references. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.
We verified each cited journal article by checking its existence in the cited journal and by searching its title using Google Scholar. The article's title, authors, publication year, volume, issue, and pages were compared. Any article that failed this verification was considered fake. To determine a reliable error rate, over 300 article references were produced on the LHS topics. For comparison, we chatted with OpenAI's default GPT-3.5 model for the same LHS topics. Exact 95% CIs for error rate

+ Supplemental content
Author affiliations and article information are listed at the end of this article.  Table 2). The error rate of reference citing for GPT-4 was significantly lower than that for GPT-3.5 (P < .001) but remains nonnegligible. Narrower topics tended to have more fake articles than broader topics.
GPT-4 provided answers that could be used as supplementary materials for LHS training after fact-checking and editing. However, it failed to provide information about the latest LHS developments.

Discussion
Our findings suggest that GPT-4 can be a helpful copilot in preparing new LHS education and training materials, although it may lack the latest information. Because GPT-4 cites some fake journal articles, they must be verified manually by humans; GPT-3.5-cited references should not be used.
When asked why it returned fake references, ChatGPT explained that the training data may be unreliable, or the model may not be able to distinguish between reliable and unreliable sources. As generative chatbots are deployed as copilots in health care education and training, understanding their unique abilities (eg, the ability to answer any questions) and inherent defects (eg, the inability to fact-check responses) will help make more effective use of the new GPT technology for improving health care education and training. Additionally, potential ethical issues such as misinformation and data bias should be considered for GPT applications.
This study has some limitations, such as the chat topics not representing all subject areas.
However, since the LHS topics covered many subject areas of health care, the findings should be applicable in the health care domain. Furthermore, the findings should be more applicable to deeper discussions with ChatGPT as opposed to superficial discussions.